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Abstract 

The haplotype resolution from xor-genotype data has been recently 
formulated as a new model for genetic studies [5]. The xor-genotype 
data is a cheaply obtainable type of data distinguishing heterozygous 
from homozygous sites without identifying the homozygous alleles. In 
this paper we propose a formulation based on a well-known model 
used in haplotype inference: pure parsimony. We exhibit exact solu- 
tions of the problem by providing polynomial time algorithms for some 
restricted cases and a fixed-parameter algorithm for the general case. 
These results are based on some interesting combinatorial properties 
of a graph representation of the solutions. Furthermore, we show that 
the problem has a polynomial time fc-approximation, where k is the 
maximum number of xor-genotypes containing a given SNP. Finally, 
we propose a heuristic and produce an experimental analysis showing 
that it scales to real-world large instances taken from the HapMap 
project. 



1 Introduction 

In this paper we investigate a computational problem arising in genetic stud- 
ies of diploid organisms. In such organisms (which include all vertebrates), 
all chromosomes are in two copies, one inherited from the mother and one 
from the father. Since chromosomes are almost identical except for specific 
gene variants called Single Nucleotide Polymorphisms (or SNPs), changes 
between variants are represented by a sequence of sites, each one bearing 
a specific value called allele. In almost all cases, for each site at most two 
different alleles are present in the population, one of which is called major 
and the other one minor. The sequence of alleles along a chromosome is 
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called haplotype, while a genotype is a sequence of unordered pairs of alleles 
that appear in each site of the two copies of the chromosome. Haplotype 
data are crucial in genetic population studies. The current technology for 
finding the two haplotypes of an individual is too expensive to be used in 
genetic studies of a population. Fortunately it is much cheaper to deter- 
mine the xor-genotype for each individual, that is the set of sites for which 
the individual is heterozygous, i.e. bearing both a major and a minor al- 
lele. The term xor-genotype derives from the fact that a site is reported in 
the genotype if and only if the two alleles in the site are different. Thus 
a xor-genotype lists only heterozygous sites, while excluding sites bearing 
identical alleles, called homozygous sites. 

The problem of reconstructing the haplotypes resolving a given set of xor- 
genotypes naturally arises and represents an interesting case of the process 
of inferring haplotypes from general genotypes {phasing). 

Polynomial time algorithms for the problem have been developed ^ [3] 
in the framework of the Perfect Phylogeny model, originally proposed by 
Gusfield [l2] to solve the phasing problem. 

In this paper, we investigate the problem under the parsimonious prin- 
ciple that asks for a smallest set of haplotypes resolving all input xor- 
geno types: such problem will be called Pure Parsimony Xor Haplo- 
TYPING (PPXH). 

Let S be a set of sites (also called characters). Then a xor-genotype 
(or simply a genotype) x is a non-empty subset of S, and a haplotype /i is a 
(possibly empty) subset of S. Given two distinct haplotypes hi, /12, then the 
pair {hi, /i2) resolves the xor-genotype x iff x = /ii ® /12, where © is defined 
as the classical symmetric difference of hi and /12, i.e. the set of characters 
that are present in exactly one of hi and /i2. A set H of haplotypes resolves 
a set X of xor- genotypes if for each genotype x G X, there exists a pair of 
haplotypes in H that resolves x. 

We are now able to formally introduce the problem that we will study 
in this paper. 

Problem 1. PuRE PARSIMONY Xor Haplotyping (PPXH). The in- 
stance of the problem is a set X of xor-genotypes, and the goal is to compute 
a smallest set H of haplotypes resolving X. 

The pure parsimony model has been used as an approach to the phasing 
problem over regular genotypes (i.e. where alleles of each homozygous site 
are specified) [13]. There is a rich literature in this area; in particular 
the APX-hardness [15] of the problem and the lack of good approximation 
guarantees have led many researchers to the design of methods based on 
linear programming techniques to find solutions of the problem [6]. Indeed, 
the best known approximation algorithm yields approximation guarantees 
of 2^~^ where k is the maximum number of heterozygous sites appearing in 
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each genotype [15]. Restricted cases of the problem with polynomial time 
solutions have been studied [211 IE] • 

In the paper we investigate the PPXH problem mainly by devising exact 
solutions of the problem by either considering fixed-parameter tractability 
or polynomial time algorithms for some restricted instances of the problem. 
We introduce a new graph representation of xor-genotypes and haplotypes, 
called xor-graph, that is crucial in the study of the PPXH problem. Indeed 
most of the results that we will present rely on combinatorial properties of 
xor-graphs. 

Initially we will show that the PPXH problem is equivalent to the prob- 
lem of building a xor-graph with the fewest possible vertices. Afterwards 
we design two polynomial time solutions for restricted instances of the 
PPXH problem. Subsequently we design a fixed-parameter algorithm of 
0{mn + 2^ km) time complexity, for k the size of the optimum solution. 
Moreover we provide a /-approximation algorithm, where / is the maximum 
number of occurrences of a character in the set of input genotypes. Finally 
we propose a heuristic for the general problem and an experimental analysis 
on real and artificial datasets. The experimental analysis shows that the 
heuristic is eff'ective on a large class of instances of various sizes where other 
methods, such as the ILP formulation proposed by Brown and Harrower [6], 
are not applicable. 

2 Basic Properties 

A fundamental idea used in our paper is a graph representation of a feasible 
solution. More precisely, given a set X of xor-genotypes, the representation 
of a set H of haplotypes resolving X is the graph Q = {H,E), called xor- 
graph associated with H, where edges of Q are labeled by a bijective function 
X : E ^ X such that, for each edge e = {hi, hj), A(e) = hi@hj. The labeling 
A is generalized to a set S by defining X{S) = {A(s) | s G S}. We call optimal 
xor-graph for X, a xor-graph associated with an optimal solution for X (that 
is a xor-graph with the minimum number of vertices). 

In this section we state some basic combinatorial properties of xor-graphs 
that will be used to prove the main results of the paper. Among all possible 
haplotypes, we identify a distinguished haplotype, called null haplotype and 
denoted by /iq, which corresponds to the empty set. Since the operation © 
is associative and commutative, by a slight abuse of language, given a family 
F = {si, . . . , s„} of subsets of a set S we denote by the expression 

Sl © S2 ® • • • © Sn- 

The cycles of a xor-graph satisfy the following property. 

Lemma 2.1. Let X he a set of xor-genotypes, let Q he a xor-graph associated 
with a set of haplotypes resolving X and let C he the edge set of a cycle of 
Q. Then ©(A(C)) is equal to the empty set. 
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Proof. By definition of cycle, C consists of a set {{hi, /i2), {h2, h^), . . . , hn+i)}, 
with hi = hn+i- By definition of xor-graph, ©(A(C)) = ®"^]^ (/ij ® /ij+i). 
By the associativity and commutativity of ©, ©(A(C)) = (/ii © hn+i) © 
(/i2 © ^2) © • ■ • © (^n © ^n)- Since hi = hn+i and h® h = for each /i, we 
obtain ©(A(C)) = 0. □ 

The above property of cycles of a xor-graph is sufficient to construct 
a set of haplotypes resolving a set of genotypes from a xor-graph. Let X 
be an instance of PPXH and let Q = {V, E) be a graph whose edges are 
biunivocally labeled by a function X : E ^ X such that ©(A(C)) = 
for each cycle C of the graph. Then it is immediate to compute a feasible 
solution H from Q where \H\ < \ V\. More precisely, we associate a haplotype 
with each vertex of Q as follows. Associate the null haplotype /iq with any 
vertex in each connected component of Q. Perform a depth- first visit of 
each connected component of Q, starting from the vertex associated with 
hf). When visiting a new vertex v of Q there must exist an edge e = {v, w) 
so that the haplotype Wh has been previously assigned to w. Then associate 
the haplotype Wh © A(e) with v. 

It is not hard to verify that our construction guarantee that H is actually 
a feasible solution of X, that is for each edge e = (f , w) of G, Vh®Wh = A(e), 
where and are respectively the haplotypes associated with v and w. 
It is trivial to notice that the property holds for all edges that are part of 
the spanning forest computed by the depth-first search. Therefore we can 
restrict our attention to edges e = , w) that are not in such spanning forest. 
Since v and w are in the same connected component of Q the spanning tree 
T of the connected component contains both v and w. Let x be the least 
common ancestor of v and w in T. By construction the two paths of T, both 
starting from x and ending one in v and the other in w are edge disjoint. Let 
us denote by Py and P^ respectively the edges of the paths ending in v and 
w, and let Xh be the haplotype associated with x. Now we want to prove 
that Vh ® Wh = A(e). It is immediate to verify that Vh = ®{^{Pv)) © Xh 
and Wh = ©(A(i-'iu)) © Xh- Since the edges in P„ U U {e} form a simple 



cycle of Q, by Lemma 2.1 we can conclude A(e) = 0(A(Pt,)) © 0(A(P«,)), 
completing the proof. 

The following results justify our attention to connected xor-graphs and 
their cuts. 

Lemma 2.2. Let X he a set of xor-genotypes and let Q be a xor-graph 
associated with a set H of haplotypes resolving X . Let a be any character 
ofTi. Then the set A of edges of G whose label contains a is a cut of Q. 

Proof. Let Ha be the subset of H containing the character a, and let Ha = 
H \ Ha- Let E' be the edges of Q with an endpoint in Ha and one in 
Ha (clearly E' is a cut oi Q.) Notice that E' is exactly the set of edges 



4 



connecting a haplotype containing a and a haplotype not containing a, 
therefore E' = A. □ 



Lemma 2.3. Let X be a set of xor-genotypes, and let Q = (H, E) he a 
disconnected xor-graph for X . Then Q is not an optimal xor-graph of X 

Proof. Since Q has at least two connected components Ci and C2, we denote 
with ai, 02 two vertices from Ci and C2 respectively. Construct the set H' 
from H by replacing each haplotype h & Ci hy h Q ai and each haplotype 
/i G C2 by /i © 02. Since Ci and C2 are not connected, the set of genotypes 
resolved by H' is equal to that of H. 

But both ai and 02 are replaced by the null haplotype in H', therefore 
\H'\ is strictly smaller than \H\. □ 

Instances and solutions of the PPXH problem can be represented by 
binary matrices. More precisely, we can have a genotype matrix associated 
with a set of xor-genotypes and a haplotype matrix associated with a set 
of haplotypes. In both matrices each column is uniquely identified by a 
character in S, while the rows of a genotype matrix (respectively haplotype 
matrix) correspond to the genotypes (resp. haplotypes). 

For example let S be the set {a, b, c, d, e} and let X be the set of xor- 
genotypes {{a, 6}, {a, b, c}, {b, c}, {c, d, e}, {a}, {e}, {a, c, e}}. A possible, al- 
beit suboptimal, set of haplotypes resolving X is {0, {c, d}, {a, b, c, d}, {a, c}, {a, d}, {d}, {e}}. 
The matricial representation of both sets is in Table [l} while the associated 
xor-graph is represented in Figure [T| 



Table 1: Example of genotype (left) and haplotype (right) matrices. 
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Given an ordering of the character set (that is S = (ai, . . . , CT|s|)), the 
entry in the i-th row and j-th column of a genotype matrix (respectively, 
haplotype matrix) is 1 if aj belongs to the i-th genotype (respectively, i- 
th haplotype) and is equal to otherwise. In the following we identify 
rows of a genotype (or haplotype) matrix with the corresponding genotypes 
(or haplotypes). Given a matrix M, we denote by M[-,A] (by M[B,-], 
respectively) the submatrix of M induced by the set A of columns (by the 
set B of rows, respectively). 
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Figure 1: Xor-graph representing the set of haplotypes in Table [T} 

Given a genotype or haplotype matrix M over E, we will say that a subset 
Si of S is a linearly dependent set of characters (or, simply, a dependent set 
of characters) in matrix M if there exists a non-empty subset S2 of Si 
such that, for each row i, (B^fzY;2M[i,a] = 0. Otherwise it is called linearly 
independent (or simply independent). 

While solving the PPXH problem, we can restrict our attention to a 
maximal independent subset of characters as stated in the following lemma. 

Lemma 2.4. Let X be a xor-genotype matrix and H be a haplotype matrix 
over the same character set S. Let Si be a maximal independent subset of 
S in X. Then, H resolves X if and only i/ff[-,Si] resolves X[-,Si] 

Proof. The only-if part is obviously true because -ff[-,Si] and X[-,Si] are 
two submatrices of H and X respectively. The if part can be proved by 
constructing a feasible solution H for X from the smaller solution H[-, Si] 
for X[-, Si] (for simplicity we will refer to the two submatrices respectively 
as H' and X'). For each character a G S \ Si, since Si U {a} is dependent 
there exists a non-empty subset Sq, of Si such that, for each genotype x, 
X[x,a] = ®a,=.Y,^X'[x,a]. Set the entry H[x,a\ to ®a&T,aH'[x,a]. 

We claim that if resolves X. Since H' resolves X', it suffices to prove 
that for each character q G S \ Si, if[/ii, a] © H[h2, a] = X[x, a], for some 
pair of haplotypes /ii , /12 • We already know that for each genotype x' of 
X' , there is a pair (/i'1,/12) haplotypes in H' that resolves x' . Notice 
that X[x,a] = (Ba^Sa^'i^jCr] since Si is a maximal subset of independent 
characters of X. Since H' resolves X', ©o-gSc-^'l^;, o"] = ©o-eSa (-f^'i^ii cr] © 
H'[h2,a]). Moreover, by the associativity of ©, Q)creT,a{H'[hi,a]®IL'[h2,cr]) = 
(©(TeSa-f^'[^i' "^D © (©o-eSa-f^'[^2, c]). Finally, by our construction of the 
columns of H corresponding to characters in S \ Si, (©o-gSc-f^'i^ii f'']) © 
(©o-es„ii'[^2 5 o"]) = H[hi, a] © H[h2, a], hence completing the proof. □ 

Notice that, given a n x m xor-genotype matrix X, a maximal subset 
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of independent characters in X can be extracted by applying the Gauss- 
ehmination algorithm on the matrix X in 0{nm?') time. Observe that the 



proof of Lemma 2.4 is constructive and shows how to compute efficiently a 
solution H for X from a solution //[-jEi] for X[-,Si]. We can introduce 
another simplification of the instance which can be performed efficiently. It 
affects the construction of the xor-graph and allows an efficient reconstruc- 
tion of an optimal xor-graph for the general instance, given a xor-graph for 
the reduced or simplified instance. 

Lemma 2.5. Let X he an instance of PPXH, and let a he a character of X 
such that there exists exactly one genotype x €z X with a £ x. Then there 
exists an optimal xor-graph Q for X such that there is a vertex v of Q with 
exactly one edge e incident on v and A(e) = x. 

Proof. Let Q be an optimal xor-graph Q for X. Since a appears in only one 
genotype in X, there is exactly one edge e of ^ such that a G A(e). By 



Lemma 2.2 removing e from Q results in a bipartition {Ha, Ha] where Ha 
consists of the haplotypes containing a. Let v S Ha and w S Ha be the two 
endpoints of e, and let D be the set of vertices of Ha adjacent to v. Change 
each haplotype in /i G Ha \ {v} to h(Bv obtaining a new xor-graph Qi. 

By construction, Qi has set of edges Ei = E\{{v , d) \ d £ D} U {{w, d) \ 
d G D}. Lideed, let e = {v,d) be any edge of Q connecting v with a 
vertex d G L>; in Qi there is an edge / = {w,d) such that A(e) = A(/). 
It is immediate to notice that Q and Q\ have the same number of vertices, 
therefore Qi is optimal and satisfies the statement of the lemma. □ 



Also the proof of Lemma 2.5 is constructive and can be exploited directly 
in an algorithm to simplify the instance of the problem. More precisely, the 
removal of a genotype and a character as stated by Lemma |2.5| can be 
repeated until all characters appear in at least two genotypes (or we obtain 
the special case of an instance containing only one genotype; in such case 
the optimal solution is trivially made by two haplotypes.) Following the 
same idea, if the set of characters is linearly dependent, we can extract a 
maximal subset of linearly independent characters. Moreover the executions 
of the two reductions can be intertwined until none of those reductions can 
be performed. 

An instance X of PPXH is called reduced if (i) X consists of only one 
genotype, or the two following conditions are satisfied: (ii a) the set of 
characters of X are independent and (ii b) each character appears in at least 



two genotypes. Lemmas 2.4 and 2.5 justify the fact that we will assume in 
the rest of the paper that all instances are reduced, as the reduction process 
can be performed efficiently, and we can easily compute a solution of the 
original instance given a solution of a reduced instance (see Algorithm [T] for 
a more detailed description of the reduction process). 
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Algorithm 1: The reduction step 



Data: a xor-genotype matrix X 

Result: a reduced xor-genotype matrix associated with the input 
matrix X 

1 repeat 

2 C subset of linearly independent columns of X obtained by the 
Gauss-elimination algorithm; 

3 D^X\C; 

4 A set of symbols appearing in exactly one genotype in X\ 

5 Remove from X all columns in D and all rows in A\ 

6 until D and A are both empty; 

7 return X; 



The reduction process leads us to an important lower bound on the size 
of the optimum. 

Lemma 2.6. Let X be a reduced genotype matrix having n rows and m 
columns. Then any haplotype matrix H resolving X has at least m + 1 rows. 



Proof. Let ^ be a xor- graph for X. By Lemma 2.2 each character a induces 
a cut in graph Q. Each cut can be represented as n-bit binary vector Ca in 
which each element Ca[i] is equal to 1 if and only if the genotype Xi belongs 
to the cut. Clearly, such vector is precisely the column vector corresponding 
to character a of matrix X. Thus, since the characters are independent, also 
the family of the cuts (represented as binary vectors) induced by the set of 
characters is linearly independent. By Theorem 1.9.6 of [7J all connected 
graphs with m independent cuts have at least m + 1 vertices. □ 



As a consequence of Lemma |2.6[ in a reduced xor-genotype matrix, the 
number of rows is greater than or equal to the number of columns. In 
fact, in any matrix, the number of linearly independent columns is equal 
to the number of linearly independent rows and, clearly, is bounded by the 
minimum between the number of columns and the number of rows. 

The process of reducing a xor-genotype matrix by restricting ourselves to 
a maximal subset of independent characters is an application of the kernel- 
ization technique for designing a fixed-parameter algorithm [8]. The tech- 
nique consists of reducing the original instance to a new instance whose size 
depends only on the parameter (in our case the size of the optimal solution.) 
The size of the reduced xor-genotype matrix is clearly bounded by a polyno- 
mial function of the optimum k since at most 0{k'^) distinct genotypes can 
be generated by k distinct haplotypes and, by the previous consideration, 
the number of columns is less than the number of rows. As a result, the 
number of entries of a reduced xor-genotype matrix is bounded by 0{k'^). 
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3 Algorithms for Restricted Instances 



In this section we investigate two restrictions of the PPXH problem obtained 
by bounding the number of characters that can appear in each genotype and 
the number of genotypes where a character can occur. Those restrictions 
are summarized by the following formulation. 

Problem 2. CONSTRAINED Pure Parsimony Xor Haplotyping (PPXH(p, q)). 
The instance consists of a set X of xor- genotypes, where each xor-genotype 
X € X contains at most p characters, and each character appears in at most 
q xor-genotypes. The goal is to compute a minimum cardinality set H of 
haplotypes that resolves X. We use the symbol oo when one of parameters 
p or q is unbounded. 

More precisely we will present efficient algorithms for the case when each 
character is contained in at most two xor-genotypes (PPXH(oo, 2)) and the 
case that each genotype consists of at most two characters (PPXH(2, oo)). 

3.1 A Polynomial Time Algorithm for PPXH(cx),2) 

The structure of the cycles in a xor-graph characterizes the solutions for the 
PPXH(oo, 2) problem as stated in the following Lemma. 

Lemma 3.1. Let X be a reduced instance of PPXH(oo,2), let Q he an 
optimal xor-graph for X , and let e he an edge of Q. Then e helongs to 
exactly one simple cycle of Q . 

Proof. Assume to the contrary that an edge e belongs to two cycles Ci 
and C2. Notice that the three sets Ci \ C2, C2 \ Ci, Ci fl C2 are pairwise 



disjoint and not empty. Let d be any element of A(Ci n C2). By Lemma 2.1 
e(A(Ci \ C2)) = e(A(C2 \ Ci)) = e(A(Ci n C2)). Consequently there exist 
three distinct edges ei S Ci \ C2, 62 G C2 \ Ci, 63 G Ci H C2 such that A(ei), 
A(e2), A(e3) all contain d, which contradicts the fact that there are only two 
genotypes containing d. By the first part of the proof, we have now to prove 
that e belongs to at least a cycle of Q. 

Assume to the contrary that X is a smallest counterexample, that is no 
such xor-graph exists for X, while such graph exists for all reduced instances 
with fewer genotypes, and let Q be any optimal xor-graph for X. Since there 
is an edge that does not belong to any cycle of Q, there is a character a such 
that both edges ei and 62 containing a do not belong to any cycle. Notice 
that two such edges must exists, since the instance is reduced. Let us denote 
by a, j3 and 7 respectively the sets A(ei) n A(e2), A(ei) \ A(e2), A(e2) \ A(ei). 

Compute a new reduced instance Xi from X by removing the xor- 
genotypes ei and 62, and adding a new genotype Xc = /3 U 7 = ei © 62. 
Clearly Xi is a reduced instance of PPXH(oo, 2) smaller than X, therefore 
X\ admits an optimal graph Qi where all edges are in some cycle. Let 
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us consider the unique cycle C of Qi containing the edge e = {u,v), with 
A(e) = Xc- Now, starting from Qi, compute a xor-graph Q2 for instance X, 
by adding to Gi a new vertex w, two edges e[ = {u, w), e'2 = {w,v), so that 
A(e'^) = ei and A(e2) = 62, and by removing edge e. The graph G2 is a 
xor-graph of X as = ei ® 62- Clearly the newly obtained graph Q2 is a 
xor-graph for X satisfying the requirements of the lemma and Q2 contains 
one more vertex than Qi. 

We have to prove that Q2 is optimal, therefore assume that G2 is not 
optimal and let Q^: be an optimal xor-graph for X, that is ^* has no more 
vertices than Qi. It is immediate to notice that contracting each of ei and 62 
into single vertices result in a xor-graph that is a solution of Xi with fewer 
vertices than Qi, hence violating the optimality of t/i. □ 



Since the optimal xor-graph is connected (Lemma 2.3) and consists of a 



set of edge-disjoint cycles (Lemma 3.1), the size of the optimum solution is 



equal to |X| -|- 1 — |C|, for \X\ the number of genotypes or edges of the graph 
and \C\ the number of simple cycles of the graph, since any set of \C\ simple 
cycles on a graph with at most \X\ — \C\ must share at least an edge. 

Algorithm [2] solves the PPXH(oo, 2) problem by computing the set C of 



all simple cycles of an optimal xor-graph. In fact Lemma 3.1 allows us to 



introduce a binary relation R between genotypes, where two genotypes are 



related if and only if they share a common character. By Lemma 3.1 any 
two genotypes (or edges of the xor-graph) that are related must also belong 
to the same simple cycle. It is immediate to notice that the partition of 
the edges of the xor-graph into simple cycles is equal to the most refined 
partition of edges such that any two edges sharing a common character 
belong to the same set of such partition. In fact Algorithm [2] computes 
exactly the closure of R. 



3.2 A Polynomial Time Algorithm for PPXH(2, 00) 

For simplicity's sake we will assume that the instance of the problem is a 
genotype matrix X and the desired output is a haplotype matrix H. 

We remember that in both matrices the columns are indexed by char- 
acters therefore we will denote by X[-,(t] (respectively H[-,a]) the column 
of X (resp. H) indexed by the character a. The algorithm is based on 
Lemmas 12.41 and 12.61 

In fact we will first compute a largest set Si of independent characters 
in X. Moreover for each character a E E \ Si we determine the subset Sq 
of Si such that X[-, a] = (Bo-ei^a-^i'i '^]- Notice that this step can be carried 
over by a simple application of the Gauss-elimination algorithm. 

Let X' be the submatrix X[-,Si]. An optimal solution of the instance 
X' is the matrix H' containing |Si| -|- 1 rows. More precisely the i-th row of 
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Algorithm 2: The algorithm for PPXH(oo, 2) 



Data: a set X of reduced xor-genotypes 
Result: an optimal solution H for X 

1 C ^ 0; 

2 while X / do 

3 X any element of X; 

4 C^{x}; 

5 repeat 

6 D -(r- the set of genotypes in X \ C sharing at least one 
element with some genotypes in C; 

7 C^CUD; 

8 until D = 0; 

9 Add C to C; 

10 Remove all genotypes in C from X; 

11 end 

12 foreach C G C do 

13 I Transform C into a cycle; 

14 end 

15 Build a graph from C, so that all cycles in C share a common 
vertex; 

16 return H 



H' , for 1 < i < jSil, consists of all zeroes, except for the i-th column (where 
it contains 1). The last row contains only zeroes. 

Clearly H' resolves X' . In fact it is immediate to notice that each row 
of X' contains at most two Is, as the same property holds for X, therefore 
for each row r of X' there are two rows of H' resolving r. The optimality 
of such solution is a direct consequence of Lemma 2.6 Clearly H' is not a 
feasible solution of the original instance X , but such a feasible solution H 
can be easily computed from H' by adding, for each character a € S \ Si, a 
column equal to ©o-eSa-f^'["! ^] (where satisfies A[j;,a] = ©o-gSa-'^'[^; o"] 
for each genotype x.) The matrix H is a feasible solution as shown in the 
proof of Lemma |2.4[ 



4 Fixed-Parameter Tractability of PPXH 

As observed at the end of Section [2} the reduction of an instance of the 
unrestricted PPXH problem lead us to a fixed-parameter algorithm (where 
the parameter is the optimum). Moreover we can observe that there ex- 
ists another fixed-parameter algorithm for the unrestricted PPXH problem 
without using the reduction of input instance. Let H he a set of haplotypes 
and let A be a set of genotypes resolved by H; in the following we will de- 
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note \X\ by n. Since H can resolve at most [ 2 ) genotypes, n > \H\ > \p2/n. 
In other words, if k is the size of the minimum-cardinahty set of haplotypes 
resolving X, n G 0{k^). 

The number of the possible graphs with at most n+1 vertices and exactly 
n edges is no more than 2^"^°^2("+i) = (n ^ 1)^" which, by our previous 
observation, is 0{k^^'^\ i.e. a function dependent only on k. The time needed 
to check if one of such graphs is a xor-graph for H is clearly polynomial in 
n and thus we can immediately derive a fixed-parameter algorithm to find 
an optimal xor-graph for X. 

The time complexity of the algorithm is well beyond what is deemed 
acceptable in practice, therefore we propose a more efficient algorithm that 
is based on the matrix representation of genotypes and haplotypes. 

In the following we will assume that the genotype matrix X is reduced, 
and that X has n rows and ra independent columns, and that we are looking 
for a haplotype matrix H with at most k distinct rows that resolves X. The 
basic idea of our algorithm is to enumerate all possible haplotype matrices. 

In the naive approach, testing if a haplotype matrix resolves a given 
genotype matrix requires O(fc^nm) time because each pair of haplotypes has 
to be considered and then each resulting genotype has to be searched in the 
genotype matrix. Our strategy, instead, is to enumerate all the haplotype 
matrices by changing only one haplotype each time, in such a way that only 
k — \ new pairs of haplotypes must be considered when testing if K resolves 
the set X. 

We use Gray codes [1^ to visit all the haplotype matrices in such a 
way that each pair of consecutive matrices differs by a single bit and, thus, 
by a single haplotype. More precisely, we enumerate all /c x m matrices by 
generating all /cm-long bit vectors. Indeed, the bits from position (i — l)m+l 
to position im in a /cm-long vector give the i-th row of the matrix (for 
1 < i < A;). The fastest known algorithm for computing the next vector of 
a Gray code requires constant time for each invocation [1]. 

Observe that the naive algorithm requires 0(nrn) time to test if there is 
a genotype in matrix X resolved by a pair of haplotypes. By representing 
the set of the row vectors of matrix X as a binary trie [9] , the time required 
to get the index of the row containing a m-long binary vector is reduced to 
0{m). 

The details of the fixed-parameter algorithm are given in the Algo- 
rithm [3j where we also use some additional data structures: the array Re- 
solvedByHowMany which associates with each genotype the number of pairs 
of haplotypes resolving such genotype, and ListResolvedGenotypes which as- 
sociates with each haplotype h a list of the relevant pairs of haplotypes in 
which h is involved. In fact, the elements of the lists in ListResolvedGeno- 
types are triples (/ii,/i2,x) where (/ii,/i2) is a pair of haplotypes resolving 

X. 

Notice that the outermost foreach loop (lines 6-25) iterates 2^^"^ times. 
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while the for loop at lines[T5 -21 iterates k times. Each iteration of the latter 
loop consists of a lookup in a trie (which can be done in 0{m) time) and 
updating in constant time some arrays and lists. Since each list can contain 
at most k elements, the time required for each iteration of the outermost 
loop is 0{km), resulting in an overall 0{nm + 2^ km) time complexity. 

5 An Approximation Algorithm 

We present a simple approximation algorithm, detailed as Algorithm |4j 
which guarantees for a reduced instance X of PPXH an approximation factor 
I, where k is the maximum number of xor- genotypes where each character 
appears. 

Initially the set H of haplotypes computed by the algorithm contains 
only the null haplotype. While the set of genotypes is not empty, pick a 
character a that appears in at least a genotype, move to H all genotypes 
containing a, and remove from X all genotypes that are solved by a pair 
of haplotypes in H. Clearly the final set of haplotypes H solves the set of 
genotypes X. 

The proposed algorithm returns a solution of size at most I times larger 



than the optimum which, by Lemma 2.6, is at least |E| + 1. Our algorithm 



starts with a partial solution H containing only the null haplotype, and 
at each iteration adds at most / haplotypes to the solution as / is the 
maximum number of genotypes containing any character. Since there can 
be at most |S| steps, \H\ < + 1. 

Clearly the approximation ratio is at most + + 1) < /, 

completing the proof. 

6 Solving PPXH by a Heuristic Method 

In this section we propose a heuristic algorithm to build a near optimal 
xor-graph for an input matrix X of genotypes. Observe that an optimal 
xor-graph for X is a graph having the minimum-cardinality vertex set and 



where each edge is uniquely labeled by a genotype X. By Lemma 2.1, a 
cycle of the xor-graph consists of a subset X' of the input genotypes such 
that (BX' = 0. Consequently we will call a subset X' with (BX' = a 
candidate cycle. 

The basic idea that guides our heuristic is first to select a subset of the 
candidate cycles of X and then to build a labeled graph (a xor-graph) where 
the selected candidate cycles are actual cycles. The procedure successively 
iterates over the genotypes that are not yet successfully realized in the xor- 
graph. 

A related problem is the one called Graph Realization ( GR ) [20j , which 
consists of building a graph given its fundamental cycles. We recall that the 
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set C of fundamental cycles of a graph G with respect to a fixed spanning tree 
T of G, is defined as C = {the unique cycle of T U {e} | e € E{G) \ E{T)} 
(see e.g. [7], pag. 26). More precisely, the Graph Realization problem can be 
formally stated as follows [20]. Given two disjoint sets T and C, the input 
of the GR problem is a family F of subsets of T U C such that (i) for each 
set Fi of the family F, FiOC = {cj}, and (ii) for each pair of subset Fi and 
Fj of F, FiCi Fj n G = 0. The GR problem consists of finding a labeled 
graph G = {V, E) (if such a graph exists) which realizes F, that is there is 
a bijection between the set T and a spanning tree of G, and the elements of 
each set Fi label exactly the edge set of a (simple) cycle of G. 

In the case that we have selected a set of candidate cycles which are 
fundamental cycles of a graph G, an immediate application of any algorithm 
solving the GR problem (two almost linear time algorithms exist [5l llOj). 
gives a xor-graph resolving all those candidate cycles. We have been inspired 
by those algorithm for GR to develop our heuristic. We denote by G{F) a 
graph realization G of a family of sets F. 

The heuristic procedure transforms a r x c genotype matrix X into an 
instance of GR as described in the following two main steps. In the first 
step, the set T is defined as a maximal subset T = {xj^, . . . ,Xj^} of lin- 
early independent input genotypes of X. This means that any other input 
genotype Xi can be expressed as a linear combination Oi^iXj-^ © . . . ® Oi^cXj^ 
of the genotypes in T. Then G = {ci,. . -Cr-c} is defined as consisting of 
the set of genotypes not in T. In a second step, the family of subsets of 
T U C giving an instance of the GR is built by building sets Fi such that 
Fi = {cj} U {xj, G T I Oi^i = 1}. Informally, Fj consists of Cj and the 
unique set Pi C T, such that (BPi = {ci}. An immediate consequence of our 
definition is that (BFi = 0, therefore Fi is, by definition, a candidate cycle. 

Computing the set T from X is simply a matter of running the Gauss 
elimination algorithm on X^ (that is the transpose matrix of X.) The 
family F can be easily inferred by computing the coefficients Oi.i, . . . , ai^c 
for all Ci G C, where the unknowns a\ . . .a^ are the coefficients of the linear 
combination and the binary matrix M is a matrix whose columns are the 
xor-genotypes in /. 

Clearly, the Gauss-elimination procedure applied on the matrix X'^ re- 
sults in a matrix R whose first r columns form the identity matrix while the 
other columns are the vectors of the linear combination coefficients. 

We have a final hurdle, that is to handle the case where the GR does 
not exist for the family F. Once the family F is identified, the heuristics 
computes a maximal subfamily F' of F, so that there exists a GR from 
F' . Now, let us detail the construction of the family F giving an instance 
of GR. The heuristic starts defining F' as an empty family and iteratively 
adding to F' a candidate cycle Fi if and only if the resulting family admits 
a Graph Realization. Clearly, this approach ends with a maximal subset 
of candidate cycles that admits a Graph Realization. The two steps of the 
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heuristic procedure are then recursively iterated on the set of xor-genotypes 
of X that do not label an edge of the computed Graph Reahzation. The 
details of the procedure are presented in Algorithm [5| 

Let n and m be, respectively, the number of xor-genotypes and sites. 
The time complexity of the heuristic is determined by the time complexity 
of the Gauss elimination algorithm, which requires O(n^m) time because 
it is called on matrix X'^, and of the Graph Realization algorithm, whose 
best time complexity is 0{a{n,m)nm), where a is the inverse Ackermann 
function. Notice that the Graph Realization algorithm is repeated at most 
n times in order to compute a maximal subfamily F' , hence 0(a(n, m)n'^m) 
times. Finally, there is at least one xor-genotype of X that labels an edge 
of the Graph Realization, hence the total number of iterations is at most n, 
leading to an overall time complexity 0{a{n,m)n^m). 

6.1 Experimental Results 

We have implemented our heuristic as a C program using the software 
GREAL [T] as a routine to solve the Graph Realization problem. The GREAL 
program implements the algorithm of Gavril and Tamari [TT] even if its time 
complexity is O(nm^) (opposed to the 0{a(n, m)nm) time complexity of the 
best known algorithm), since it is still effective for our purposes. 

The experimental analysis of our heuristic is composed of two parts. 
In the first part we have applied the algorithm on synthetic instances to 
evaluate the quality of the results in terms of cardinality of the solutions 
and running time. In the second part we have assessed the applicability of 
the heuristic to some real-world large instances. 

6.1.1 Synthetic Data 

Each synthetic instance has been created starting from a set of initial haplo- 
types and then each xor-genotype has been generated as combination of two 
haplotypes randomly selected from the initial set. Notice that such process 
does not guarantee that every haplotype is selected to form a genotype. 

We have used two different methods to generate the set of initial hap- 
lotypes: (a) pure random generation, and (b) generation under the neutral 
model. The first strategy, pure random generation, selects uniformly sets 
of h distinct haplotypes from the set of all binary haplotypes of length m. 
The second strategy, generation under the neutral model, uses the standard 
Hudson's simulator ms [14J to generate a sample of h haplotypes assuming 
the neutral model of genetic variation. In this case, the sample of haplotypes 
can contain repeated elements. Using two different methods to generate the 
set of initial haplotypes allows us to verify if the behavior of the heuristic is 
influenced by the choice of the initial haplotypes. 

The evaluation criteria, in both cases, were (a) the number of distinct 
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haplotypes computed by our method, and (b) its running time. In particular, 
we have considered as main indicator of the quality of the solutions the ratio 
(r) between the number of distinct haplotypes of the computed solution and 
the number of distinct initial haplotypes selected to generate a genotype of 
the instance. We notice that r is only a proxy for the actual approximation 
ratio (that is the ratio between the number of distinct computed haplotypes 
and the size of a optimal solution) achieved by the algorithm, as the number 
of the selected haplotypes represents only an upper bound of the optimum, 
thus the ratio r might be strictly less than 1. 

Since the outcome of our heuristic can be influenced by the order of the 
input genotypes, for each instance we have run the algorithms on ten random 
permutations of the genotypes, and we have retained only the smallest set of 
computed haplotypes. The running time refers to the total time required by 
the heuristic on the 10 permutations of genotypes and has been measured 
on a standard PC with 1GB of memory with CentOS Linux 5. 

The pure random generation strategy is characterized by three parame- 
ters, namely the number of input genotypes (n), the number of haplotypes 
(/i), and the number of characters (m). We have considered 4 different val- 
ues of the parameter n (100, 200, 300, 400), and we have computed the 
values of h and m from n: in fact those values are ra/4, n/3, and 2n/3. The 
maximum size of the test instances (400 genotypes and 233 characters) has 
been chosen in such a way that repeated tests on several instances of the 
same size would be feasible on a normal computer. In fact, as discussed 
below, on average the heuristic required roughly an hour on the largest in- 
stances, therefore any further increase of the instance size would have made 
the experimentation impractical. 

Table [2] reports the average size of the solutions computed by our heuris- 
tic, its average running time, and the average ratio r on 10 random instances 
generated for each choice of the parameters n, /i, and m. 

The second strategy, generation under the neutral model, is characterized 
by the three parameters n, m, and p, where n is the number of genotypes, m 
is the number of characters, and p is the crossover (or recombination) rate of 
the Hudson's program. The size of the initial sample of haplotypes has been 
set equal to the number n of genotypes. Since the sample can contain several 
copies of the same haplotype, the number of distinct haplotypes randomly 
selected to form a genotype has been significantly lower than the number of 
genotypes for almost all of the generated instances. 

We considered 30 instances for each choice of the parameters {n,m, p) 
with n e {50, 75, 100}, m e {50, 75, 100}, and p G {0, 8, 16, 24}. As for the 
previous dataset, Table |3] reports the average size of the solution computed 
by our heuristic, its average running time, and the average ratio r. 

On both datasets the heuristic produces comparable results. In partic- 
ular, the average ratio is never larger than 1.57, while quite often it is close 
to 1. In other words, it can often reconstruct a solution of size similar to the 
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number of the haplotypes used to generate the instance and, in the worst 
case, the computed solution is at most 1.57 larger than the set of initial hap- 
lotypes. The ability of computing a good approximation seems affected by 
two combined factors: the number of independent characters of the genotype 
matrix and the number of initial haplotypes. Indeed in both tables we can 
observe that the smaller the number of independent characters compared to 
the number of initial haplotypes, the worse is the computed solution. Con- 
versely good solutions are computed by the heuristic when the number of 
independent characters is close to the number of initial haplotypes. 

Lemma 2.6 offers a possible explanation to such regular behavior of our 
heuristic. In fact, let H be the set of initial haplotypes of an instance X 
and suppose that they are defined on a set S of independent characters such 
that \H\ = |S| -|- 1 (i.e. H is also a solution that meets the lower bound of 
Lemma 2.6). Then, the set T computed during step [T] of the heuristic al- 
gorithm contains exactly |S| independent xor- genotypes. As a consequence, 
the set C computed in the same step admits a Graph Realization and, thus, 
the heuristic solves optimally the instance X. Although this is not the 
general case, our intuition suggests that, when the number of independent 
characters is close to the number of initial haplotypes, the selection of the 
set T is constrained and the Graph Realization of the maximal subset of C 
computed by the heuristic is similar to the xor-graph associated with the 
initial haplotypes. Conversely, if the number of independent characters is 
significantly lower than the number of initial haplotypes, there are a lot 
of degrees of freedom in the choice of the set T, thus the output of the 
Graph Realization step can vary greatly from the xor-graph of the initial 
haplotypes. 

The time required by the heuristic to compute a solution to the pure- 
random synthetic instances varies between circa 25 seconds on instances 
with 100 genotypes and 70 minutes on instances with 400 genotypes. All 
the instances generated using the neutral model, instead, have been solved in 
less than 1 minute. We also observe that instances where the heuristic fails to 
find a good solution have been solved considerably faster than the ones where 
the heuristic computes a good approximation. However, a more careful 
analysis suggests that such fluctuations are due to the different amount of 
I/O operations needed to communicate with the GREAL software that we use 
to solve the Graph Realization problem. 

Finally we tried to compare our heuristic method with the ILP formu- 
lation proposed by Brown and Harrower [6]. In the paper, they formulate 
the PPXH problem as a polynomial-size integer linear program and they 
introduce cuts and modification of the objective function that should help 
finding the optimal solution. However, the GLPK solver [17J, using the basic 
formulation as well as the augmented formulations, was not able to find a 
feasible solution even for the smallest instances of our experimentation (50 
genotypes and 50 characters) within the maximum time of 24 hours. 
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6.1.2 Real Data 

To validate the feasibility of applying our heuristic on real data, we have 
produced some instances from the Phase I dataset of the HapMap project 
|19] (release 2005-06_16c.l). A set of xor-genotypes were produced from the 
data for each population in the dataset (discarding non biallelic sites and 
non autosomal chromosomes). Those instances vary from 44 genotypes and 
184604 sites to 90 genotypes and 91812 sites. On average, an instance con- 
tains 67 genotypes and 46906 sites. On all those instances our heuristics has 
never required more than 2 seconds on the same PC used in the experimen- 
tal part over synthetic instances, clearly establishing that the heuristic can 
be successfully used on real-world large instances. 

7 Conclusions and Future Work 

In the paper we investigate the problem of resolving xor-genotypes under 
the pure parsimony model. We give several results regarding the efficient 
solution of the problem by considering fixed-parameter algorithms or by re- 
stricting the instances of the problem. Most of the results are based on 
combinatorial properties of a graph representation relating a feasible solu- 
tion to the instance: the xor-graph. The computational complexity of the 
unrestricted problem is still unknown. Since we show that PPXH(oo, 2) and 
PPXH(2, oo) have polynomial time algorithms, it would be interesting to de- 
termine the complexity of PPXH(oo, 3) and PPXH(3, oo), as these two cases 
could delimit polynomial time solvability and intractability of the general 
problem. We believe that the xor-graph could play a crucial role in solving 
these open problems. 

Acknowledgments 

PB, GDV and YP have been partially supported by FAR 2008 grant "Com- 
putational models for phylogenetic analysis of gene variations". PB has 
been partially supported by the MIUR PRIN 2007 Project "Mathematical 
aspects and emerging applications of automata and formal languages" . 

References 

[1] T. Barzuza. GREAL - software for the graph realization problem. 

[2] T. Barzuza, J. S. Beckmann, R. Shamir, and I. Pe'er. Computational 
problems in perfect phylogeny haplotyping: Xor-genotypes and tag 
SNPs. In Proc. 15th Symp. on Combinatorial Pattern Matching ( CPM), 
volume 3109 of LNCS, pages 14-31. Springer, July 5-7, 2004. 



18 



[3] T. Barzuza, J. S. Bcckmann, R. Shamir, and I. Pe'er. Computational 
problems in perfect phylogeny haplotyping: Typing without calling the 
allele. IEEE Transactions on Computational Biology and Bioinformat- 
ics, 5(1):101-109, 2008. 

[4] J. R. Bitner, G. Ehrlich, and E. M. Reingold. Efficient generation of 
the binary reflected Gray code and its applications. Communications 
of the ACM, 19(9):517-521, 1976. 

[5] R. E. Bixby and D. K. Wagner. An almost linear-time algorithm 
for graph realization. Mathematics of Operations Research, 13:99-123, 
1988. 

[6] D. G. Brown and I. M. Harrower. Integer programming approaches to 

haplotype inference by pure parsimony. IEEE Transactions on Com- 
putational Biology and Bioinformatics, 3(2):141-154, 2006. 

[7] R. Diestel. Graph Theory, volume 173 of Graduate Texts in Mathemat- 
ics. Springer- Verlag, Heidelberg, third edition, 2005. 

[8] R. Downey and M. Fellows. Parameterized Complexity. Springer Verlag, 
1999. 

[9] E. Fredkin. Trie memory. Communications of the ACM, 3(9):490-499, 
1960. 

[10] S. Fujishige. An efficient PQ-graph algorithm for solving the graph 
realization problem. Journal of Computer and System Science, 21:63- 
68, 1980. 

[11] F. Gavril and R. Tamari. An algorithm for constructing edge-trees from 
hypcrgraphs. Networks, 13(3):377-388, 1983. 

[12] D. Gusfield. Haplotyping as perfect phylogeny: Conceptual framework 
and efficient solutions. In Proc. 6th Ann. Conf. on Research in Com- 
putational Molecular Biology (RECOMB), pages 166-175, 2002. 

[13] D. Gusfield. Haplotype inference by pure parsimony. In Proc. 14th 
Symp. on Combinatorial Pattern Matching (CPM), pages 144-155, 
2003. 

[14] R. R. Hudson. Generating samples under a Wright-Fisher neutral model 
of genetic variation. Bioinformatics, 18(2):337-338, Feb. 2002. 

[15] G. Lancia, M. C. Pinotti, and R. Rizzi. Haplotyping populations by 

pure parsimony: Complexity of exact and approximation algorithms. 
INFORMS Journal on Computing, 16(4):348-359, 2004. 



19 



[16] G. Lancia and R. Rizzi. A polynomial case of the parsimony haplotyping 
problem. Operations Research Letters, 34(3):289-295, 2006. 

[17] A. Makhorin. GLPK - the GNU Linear Programming Kit. 

[18] C. Savage. A survey of combinatorial Gray codes. SIAM Review, 
39(4):605-629, 1997. 

[19] The International HapMap Consortium. A haplotype map of the human 
genome. Nature, 437(7063): 1299-1320, 2005. 

[20] W. T. Tutte. An algorithm for determining whether a given binary 
matroid is graphic. Proceedings of the American Mathematical Society, 
11(6):905-917, 1960. 

[21] L. van lersel, J. Keijsper, S. Kelk, and L. Stougie. Shorelines of islands 

of tractability: Algorithms for parsimony and minimum perfect phy- 
logeny haplotyping problems. IEEE Transactions on Computational 
Biology and Bioinformatics, 5(2):301-312, 2008. 



20 



Algorithm 3: A fixed-parameter algorithm for PPXH 
Data: A genotype matrix X defined over a set of m independent 

characters, and an integer k. 
Result: a set H of at most k haplotypes resolving X if it exists, No 
otherwise. 

1 if <n or k <m, then return No\ 

2 if /c > n then return H U {/iq}; 

3 Build a trie T that stores the xor- genotypes contained in X; 

4 Let ListResolvedGenotypes be an array of k initially empty lists; 

5 ResolvedByHowMany ■<— (0, 0, . . . , 0); 

6 TotalResolvedG ^ 0; 

7 foreach binary matrix H in Gray code do 
if H is the matrix containing only zeros then continue to the 

next matrix; 

ChangedRow i— index of the row changed from the previous 
iteration; 

/* Update state of xor-genotypes resolved by cheinged 

haplotype */ 
foreach entry {hi, h2,x) o/ ListResolvedGenotypes [ChangedRow] 
do 

Remove {hi,h2,x) from ListResolvedGenotypes[hi] and 

ListResolvedGenotypes [11,2] ; 

ResolvedByHowMany [x] i— ResolvedByHowMany [x] — 1; 
if ResolvedByHowMany [x] = then 
TotalResolvedG ■<— TotalResolvedG — 1; 



end 



*/ 



/* Look for genotypes resolved by the new haplotype 
for r 1 to do 

if I is the index returned by the lookup of the vector 
H[r, ■] e ii" [ChangedRow, •] in T then 
if ResolvedByHowMany [/] = then 
TotalResolvedG TotalResolvedG + 1; 
ResolvedByHowMany[l] <r- ResolvedByHowMany[l] + 1; 
Add (r, ChangedRow , I) to ListResolvedGenotypes[r] and to 
ListResolvedGenotypes [ ChangedRow] ; 
end 
end 

if TotalResolvedG = n then 

/ * all genotypes are resolved */ 
Remove from H all duplicate rows; 
return H: 



end 

26 end 

27 return No; 
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Algorithm 4: The approximation algorithm 



Data: a set X of xor-genotypes over alphabet S 

1 H^{ho}; 

2 while S / do 

3 
4 
5 
6 
7 
8 



a any character in S; 
foreach x G X s.i. a G x do 
I add to H the genotype x; 
end 

Remove from X all genotypes that contains the character a; 
Remove a from S; 
9 end 

Result: H 



Algorithm 5: The heuristic Heu{X) 



Data: a xor-genotype reduced matrix X 
Result: a haplotype matrix H that resolves X 

1 r the number of rows of X; 

2 c the number of columns of X] 

3 if r = c then 

4 I return the set consisting of and the canonical haplotypes of X; 

5 end 

6 i? •(— output of Gauss elimination on X'^; 

7 T = {xi, . . . , Xc} ^ the independent xor-genotypes labeling the first c 
columns of R, and C = {xc+i, ■ ■ ■ Xr} ^ the set of the remaining 
genotypes; 

8 F ^ 0; 

9 for X: £ C do 
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Z{xi) ^ {xi} U {xj £ T I the element in row j and column i of R 
is equal to 1}; 

if F L) {Z{xi)} admits Graph Realization then 

I F ^ FU{Z{xi)}; 
end 



11 

12 
13 

14 end 

15 Let G{F) be the Graph Realization of F; 

16 H ^ the set of vertices of G{F); 

17 V a random element of H; 

18 Each h £ H becomes h®v ; // Now v is the null haplotype 

19 Remove from X the genotypes that label any edge of G{F); 

/* The instance X is reduced before the subroutine Heu is 
recursively called, and the general solution is then 
obtained as described in the proof of Lemma 2.4 */ 

20 return H U Heu{X); 
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Table 2: Results on instances generated using the pure random strategy. For 
each choice of the first three columns, 10 random instances were generated. 
The column average independent characters reports the average number of 
independent character in the genotype matrix, while column average ini- 
tial haplotypes reports the average number of distinct haplotypes selected 
to generate each instance. The last two columns report, respectively, the 
average size of the solution computed by our heuristic and the average ratio 



r. 



number of 
genotypes 
n 



number of 
generated 
haplotypes 
h 



number of 
characters 



average 
indepen- 
dent 
characters 



average 
initial 
haplo- 
types 



average 
result size 



average 
ratio 



100 



25 



33 



66 



25 
33 
66 



23.70 
24.00 
24.00 



25.00 
25.00 
25.00 



25 
33 
66 



25.00 

31.50 
32.00 



32.90 
32.80 
33.00 



25 
33 
66 



25.00 
33.00 
62.70 



63.00 
63.30 
63.90 



25.90 
25.00 
25.00 



51.60 

33.20 
33.00 



87.30 
87.20 
63.80 



1.04 

1 

1 



1.57 
1.01 
1 



1.39 
1.38 
1 



200 



50 



66 



133 



50 
66 
133 



48.70 
49.00 
49.00 



50.00 
50.00 
50.00 



50 
66 
133 



50.00 
64.50 
64.80 



65.80 
65.90 
65.80 



50 
66 
133 



50.00 
66.00 
126.60 



126.90 
126.20 
128.10 



50.90 
50.00 
50.00 



96.20 
66.20 
65.80 



185.80 
186.10 
127.70 



1.02 

1 

1 



1.46 

1 

1 



1.46 
1.47 
1 



300 



75 



100 



200 



75 
100 
200 



73.70 
74.00 

73.90 



75.00 
75.00 

74.90 



75 
100 
200 



75.00 
98.60 
98.80 



99.80 
99.90 
99.80 



75 
100 
200 



75.00 
100.00 
188.20 



190.80 
191.10 
190.30 



75.80 
75.00 

74.90 



149.80 
100.00 
99.80 



285.90 
284.60 
189.20 



1.01 
1 

1 



1.50 

1 

1 



1.50 
1.49 
0.99 



400 



100 



133 



266 



100 

133 
266 

100 
133 
266 

100 
133 
266 



98.50 

98.90 
98.90 

100.00 
131.40 
131.80 

100.00 
133.00 
251.90 



100.00 

99.90 
99.90 

132.80 
132.80 
132.80 

254.70 
253.60 
253.50 



100.90 

99.90 
99.90 

194.90 
133.00 
132.80 

385.30 
384.40 
252.90 



1.01 

1 
1 

1.47 

1 

1 

1.51 
1.52 
1 
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Table 3: Results on instances generated using the neutral model. For each 
choice of the first three columns, 30 random instances were generated. The 
column average independent characters reports the average number of in- 
dependent character in the genotype matrix, while column average initial 
haplotypes reports the average number of distinct haplotypes selected to 
generate each instance. The last two columns report, respectively, the av- 
erage size of the solution computed by our heuristic and the average ratio 

7\ 

number of 
genotypes 
n 

50 



number of 
characters 


recombination 
rate 


average 
indepen- 
dent 


average 
initial 
haplo- 


average 
result size 


average 
ratio 


m 


P 


characters 


types 




V 


50 





18.5 


19.5 


19.5 


1 




8 


20.13 


21.73 


22.03 


1.02 




16 


22.27 


24.77 


25.77 


1.04 




24 


20.63 


24.17 


25.23 


1.05 


75 





22.1 


23.13 


23.1 


1 




8 


24.63 


26 


26.27 


1.01 




16 


25.6 


27.3 


27.37 


1.01 




24 


25.3 


27.63 


28.1 


1.02 


100 





25.07 


26.13 


26.07 


1 




8 


26.7 


27.93 


27.8 


1 




16 


28.27 


29.87 


29.73 


1 




24 


28.5 


30.5 


30.2 


0.99 


50 





21.77 


22.77 


22.77 


1 




8 


23.1 


25.37 


26.63 


1.05 




16 


24.97 


29.77 


34.4 


1.16 




24 


25.1 


31.4 


38.33 


1.23 


75 





26.17 


27.2 


27.17 


1 




8 


29.9 


31.5 


31.77 


1.01 




16 


30.93 


34.63 


37.1 


1.07 




24 


31.23 


35.83 


38.83 


1.08 


100 





29.5 


.30.5 


30.5 


1 




8 


32.87 


34.23 


34.13 


1 




16 


33.67 


36.1 


36.77 


1.02 




24 


36.2 


39.73 


41.17 


1.04 


50 





24.33 


25.33 


25.33 


1 




8 


27 


30.67 


36.6 


1.2 




16 


27.53 


32.8 


41.8 


1.28 




24 


27.93 


36.1 


49 


1.36 


75 





27.83 


28.83 


28.83 


1 




8 


32.2 


34.4 


36.07 


1.05 




16 


34.23 


38.33 


43 


1.12 




24 


35.23 


42.07 


50.43 


1.2 


100 





34.5 


35.5 


35.5 


1 




8 


37.37 


.39.13 


40 


1.02 




16 


39.87 


43.27 


45.63 


1.06 




24 


38.6 


45.13 


52.87 


1.17 



75 



100 



24 



